Goto

Collaborating Authors

 dynamic generalization



Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning

Neural Information Processing Systems

Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However, learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we present a new model-based RL algorithm, coined trajectory-wise multiple choice learning, that learns a multi-headed dynamics model for dynamics generalization. The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments. Moreover, we incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector, enabling the model to perform online adaptation to unseen environments. Finally, to utilize the specialized prediction heads more effectively, we propose an adaptive planning method, which selects the most accurate prediction head over a recent experience. Our method exhibits superior zero-shot generalization performance across a variety of control tasks, compared to state-of-the-art RL methods. Source code and videos are available at https://sites.google.com/view/trajectory-mcl.





Review for NeurIPS paper: Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning

Neural Information Processing Systems

Reviewers are in favor of acceptance after the discussion and I agree. The key novelty in this work is to apply the Multiple Choice Learning framework to model based reinforcement learning. Doing so allows for the model to learn multimodal distributions over future states and the authors provide strong empirical results. Neither dynamics learning nor MCL are novel; however, their combination is novel and demonstrated to be effective. The reviewers have left a number of useful suggestions about how to further strengthen the paper in terms of writing and experimentation and I encourage the authors to make use of this feedback.


Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning

Neural Information Processing Systems

Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However, learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we present a new model-based RL algorithm, coined trajectory-wise multiple choice learning, that learns a multi-headed dynamics model for dynamics generalization. The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments. Moreover, we incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector, enabling the model to perform online adaptation to unseen environments.


Constrained Hidden Markov Models

Neural Information Processing Systems

By thinking of each state in a hidden Markov model as corresponding to some spatial region of a fictitious topology space it is possible to naturally define neigh(cid:173) bouring states as those which are connected in that space. The transition matrix can then be constrained to allow transitions only between neighbours; this means that all valid state sequences correspond to connected paths in the topology space. I show how such constrained HMMs can learn to discover underlying structure in complex sequences of high dimensional data, and apply them to the problem of recovering mouth movements from acoustics in continuous speech. Probabilistic unsupervised learning for such sequences requires models with two essential features: latent (hidden) variables and topology in those variables. Hidden Markov models (HMMs) can be thought of as dynamic generalizations of discrete state static data models such as Gaussian mixtures, or as discrete state versions of linear dynam(cid:173) ical systems (LDSs) (which are themselves dynamic generalizations of continuous latent variable models such as factor analysis).


Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

Wang, Junjie, Mu, Yao, Li, Dong, Zhang, Qichao, Zhao, Dongbin, Zhuang, Yuzheng, Luo, Ping, Wang, Bin, Hao, Jianye

arXiv.org Artificial Intelligence

The latent world model provides a promising way to learn policies in a compact latent space for tasks with high-dimensional observations, however, its generalization across diverse environments with unseen dynamics remains challenging. Although the recurrent structure utilized in current advances helps to capture local dynamics, modeling only state transitions without an explicit understanding of environmental context limits the generalization ability of the dynamics model. To address this issue, we propose a Prototypical Context-Aware Dynamics (ProtoCAD) model, which captures the local dynamics by time consistent latent context and enables dynamics generalization in high-dimensional control tasks. ProtoCAD extracts useful contextual information with the help of the prototypes clustered over batch and benefits model-based RL in two folds: 1) It utilizes a temporally consistent prototypical regularizer that encourages the prototype assignments produced for different time parts of the same latent trajectory to be temporally consistent instead of comparing the features; 2) A context representation is designed which combines both the projection embedding of latent states and aggregated prototypes and can significantly improve the dynamics generalization ability. Extensive experiments show that ProtoCAD surpasses existing methods in terms of dynamics generalization. Compared with the recurrent-based model RSSM, ProtoCAD delivers 13.2% and 26.7% better mean and median performance across all dynamics generalization tasks. Latent world models (Ha & Schmidhuber, 2018) summarize an agent's experience from highdimensional observations to facilitate learning complex behaviors in a compact latent space. Current advances (Hafner et al., 2019; 2020; Deng et al., 2022) leverage Recurrent Neural Networks (RNNs) to extract historical information from high-dimensional observations as compact latent representations and enable imagination in the latent space. However, modeling only latent state transitions without an explicit understanding of the environmental context characteristics limits the dynamics generalization ability of the world model. Since the changes in dynamics are not observable and can only be inferred from the observation sequence, for tasks with high-dimensional sensor inputs, dynamics generalization remains challenging.


Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment

Ball, Philip J., Lu, Cong, Parker-Holder, Jack, Roberts, Stephen

arXiv.org Artificial Intelligence

Reinforcement learning from large-scale offline datasets provides us with the ability to learn policies without potentially unsafe or impractical exploration. Significant progress has been made in the past few years in dealing with the challenge of correcting for differing behavior between the data collection and learned policies. However, little attention has been paid to potentially changing dynamics when transferring a policy to the online setting, where performance can be up to 90% reduced for existing methods. In this paper we address this problem with Augmented World Models (AugWM). We augment a learned dynamics model with simple transformations that seek to capture potential changes in physical properties of the robot, leading to more robust policies. We not only train our policy in this new setting, but also provide it with the sampled augmentation as a context, allowing it to adapt to changes in the environment. At test time we learn the context in a self-supervised fashion by approximating the augmentation which corresponds to the new environment. We rigorously evaluate our approach on over 100 different changed dynamics settings, and show that this simple approach can significantly improve the zero-shot generalization of a recent state-of-the-art baseline, often achieving successful policies where the baseline fails. Offline reinforcement learning (RL) describes the problem setting where RL agents learn policies solely from previously collected experience without further interaction with the environment (12; 29).